We first need to load up the work that was previously done in deliverable 1.

include <- function(library_name){
  if( !(library_name %in% installed.packages()) )
    install.packages(library_name) 
  library(library_name, character.only=TRUE)
}
include("tidyverse")
include("rvest")
include("lubridate")
include("caret")
include("knitr")
include("BBmisc")
purl("deliverable1.Rmd", output = "part1.r")
source("part1.r")

Intro

In this deliverable I am going to scrape NBA game data from the web and use it to try and predict the outcomes of games.

Getting the data

The first thing that we need to do is scrape the data from the web.
### Note This section of code takes approximately two minutes to run.

monthList <- c("october", "november", "december", "january", 
               "february","march", "april", "may", "june")
yearList <- c(1990:2011)
tib <- data.frame()
boxUrls <- data.frame()
for(year in yearList) {
  for(month in monthList) {
    url <- paste0("https://www.basketball-reference.com/leagues/NBA_",year,"_games-",month,".html")
    page <- try(read_html(url),TRUE)
    if( !("try-error" %in% class(page))){
    colNames <- page %>%
      html_nodes("table#schedule > thead > tr > th") %>%
      html_attr("data-stat")
    if(length(colNames) != 0){
    dates <- page %>%
      html_nodes("table#schedule >tbody >tr >th") %>%
      html_text()
    
    boxLink <- page %>%
      html_nodes("table#schedule >tbody >tr >td.center > a") %>%
      html_attr("href")
    dates <- dates[dates != "Playoffs"]
    
    scrapedData <- page %>%
      html_nodes("table#schedule > tbody > tr > td") %>%
      html_text() %>%
    matrix(ncol = length(colNames)-1,byrow=TRUE)
    
    month_tib <- as.data.frame(cbind(dates,scrapedData),stringAsFactors=FALSE)
    names(month_tib) <- colNames
    tib <- bind_rows(tib,month_tib)
    
    box_tib <- as.data.frame(boxLink,stringAsFactors=FALSE)
    boxUrls <- bind_rows(boxUrls,box_tib)
    
    }
    }
  }
}

We also want to make sure that the data we just scraped is in the correct type.

  tib$home_pts <- as.numeric(tib$home_pts)
  tib$visitor_pts <- as.numeric(tib$visitor_pts)
  tib$attendance  <- as.numeric(gsub(",", "",
                                     tib$attendance))
  tib$date_game <- mdy(tib$date_game)

After obtaining all of the data, we now want to organize it in a more condensed table. We are going to create a new table containing: - date: in the format year-month-day that the game took place at
- home_team: the full name of the home team in the corresponding game
- home_score: the number of points scored by the home team in the corresponding game
- away_team: the full name of teh away team in the corresponding game
- away_score: the number of points scored by the home team in the corresonding game
- overtimes: distinguishes if the game reached overtime and if so, how many. (ex. OT = 1 overtime, 2OT = 2 overtimes and so on)
- attendance: the number of people in attendance for teh corresponding game
- start time: the start time of the corresponding game formatted as hours:minutes

schedule <- tibble(date=tib$date_game,
                   home_team=tib$home_team_name,
                   home_score=tib$home_pts,
                   away_team=tib$visitor_team_name,
                   away_score=tib$visitor_pts,
                   overtimes=tib$overtimes,
                   attendance=tib$attendance,
                   start_time=tib$game_start_time)
schedule

In order to make observations easier, I am going to add each teams’ abbreviation to the schedule table. To do this, I first need to create a table with each name and abbreviation matched, which I can do using the ranking table.

ranking <- arrange(ranking,year)
abbrevs <- tibble(abbrev=ranking$team_abbrev,home_team=ranking$team_name,num = 1:nrow(ranking))
abbrevs$home_team<- as.factor(abbrevs$home_team)
name_levels <- levels(abbrevs$home_team)
levels(abbrevs$home_team)[abbrevs$home_team=="Seattle Supersonics"] <- "Seattle SuperSonics"

name_levels <- levels(abbrevs$home_team)

for(t in name_levels){
  first_occur <- max(abbrevs$num[abbrevs$home_team==t])
  abbrevs <- abbrevs %>% filter(!(home_team==t & num !=first_occur))
}

home_abbrev <- tibble(home_abbrev=abbrevs$abbrev,home_team=abbrevs$home_team)
away_abbrev <- tibble(away_abbrev=abbrevs$abbrev,away_team=abbrevs$home_team)


new_schedule <- schedule %>% 
                    left_join(home_abbrev, by="home_team")
new_schedule <- new_schedule %>% 
                    left_join(away_abbrev, by="away_team")
new_schedule

Due to almost half of the data not having values for overtimes, attendance and start time, I am going to remove those from the new_schedule table.

ind_game <- tibble(year=year(new_schedule$date),
                   home_abbrev=new_schedule$home_abbrev,
                   home_team=new_schedule$home_team,
                   home_score=new_schedule$home_score,
                   away_abbrev=new_schedule$away_abbrev,
                   away_team=new_schedule$away_team,
                   away_score=new_schedule$away_score
                   )
ind_game

Initial Model

For the initial model, I am going to try and predict total game score for the home team using the home team’s yearly made/attempted free throws, made/attempted shots,rebounds, assists, fouls, and points.

First though, we must add the yearly statistics to the ind_game table.

ind_game$year <- as.factor(ind_game$year)
new_game <- ind_game %>% 
                    left_join(statistics, by=c("home_abbrev"="team_abbrev","year"))
## Warning: Column `year` joining factors with different levels, coercing to
## character vector
new_game
colnames(new_game)[colnames(new_game)=="made_field_goal"] <- "home_yearly_made_field_goal"
colnames(new_game)[colnames(new_game)=="attempt_field_goal"] <- "home_yearly_attempt_field_goal"
colnames(new_game)[colnames(new_game)=="made_free_throw"] <- "home_yearly_made_free_throw"
colnames(new_game)[colnames(new_game)=="attempt_free_throw"] <- "home_yearly_attempt_free_throw"
colnames(new_game)[colnames(new_game)=="made_3_pointer"] <- "home_yearly_made_3_pointer"
colnames(new_game)[colnames(new_game)=="attempt_3_pointer"] <- "home_yearly_attempt_3_pointer"
colnames(new_game)[colnames(new_game)=="rebounds"] <- "home_yearly_rebounds"
colnames(new_game)[colnames(new_game)=="assists"] <- "home_yearly_assists"
colnames(new_game)[colnames(new_game)=="fouls"] <- "home_yearly_fouls"
colnames(new_game)[colnames(new_game)=="points_scored"] <- "home_yearly_points_scored"

This new_game table that we have created now contains all of the statistics we need to do our modeling. One thing that we need to do though in order for our models to not be skewed is to remove all of the rows with NA values. This will make our data tidy.

new_game <- drop_na(new_game)

new_game
simple_model <- lm(new_game, formula= home_score ~
                     home_yearly_made_field_goal+ 
                     home_yearly_attempt_field_goal+
                     home_yearly_made_free_throw+
                     home_yearly_attempt_free_throw+
                     home_yearly_rebounds+
                     home_yearly_assists+home_yearly_fouls+
                     home_yearly_points_scored+
                     home_yearly_made_3_pointer+
                     home_yearly_attempt_3_pointer)
summary(simple_model)
## 
## Call:
## lm(formula = home_score ~ home_yearly_made_field_goal + home_yearly_attempt_field_goal + 
##     home_yearly_made_free_throw + home_yearly_attempt_free_throw + 
##     home_yearly_rebounds + home_yearly_assists + home_yearly_fouls + 
##     home_yearly_points_scored + home_yearly_made_3_pointer + 
##     home_yearly_attempt_3_pointer, data = new_game)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.258  -8.288  -0.265   7.902  60.122 
## 
## Coefficients: (1 not defined because of singularities)
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    88.9517894  0.9031501  98.491  < 2e-16 ***
## home_yearly_made_field_goal     0.0714294  0.0081478   8.767  < 2e-16 ***
## home_yearly_attempt_field_goal -0.0068388  0.0005490 -12.456  < 2e-16 ***
## home_yearly_made_free_throw     0.0256056  0.0042535   6.020 1.77e-09 ***
## home_yearly_attempt_free_throw  0.0011495  0.0010868   1.058     0.29    
## home_yearly_rebounds           -0.0073446  0.0005876 -12.498  < 2e-16 ***
## home_yearly_assists            -0.0027153  0.0006216  -4.368 1.26e-05 ***
## home_yearly_fouls              -0.0083960  0.0005792 -14.497  < 2e-16 ***
## home_yearly_points_scored      -0.0205687  0.0038496  -5.343 9.22e-08 ***
## home_yearly_made_3_pointer             NA         NA      NA       NA    
## home_yearly_attempt_3_pointer   0.0074368  0.0015217   4.887 1.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.12 on 25270 degrees of freedom
## Multiple R-squared:  0.1268, Adjusted R-squared:  0.1265 
## F-statistic: 407.8 on 9 and 25270 DF,  p-value: < 2.2e-16
ggplot(new_game, aes(x=home_yearly_made_field_goal,y=home_score)) + geom_smooth(method=lm)

ggplot(new_game, aes(x=home_yearly_made_field_goal,y=home_yearly_assists)) + geom_point()

This model somewhat make sense but there are a few things that don’t seem to add up. Of course, as made field goals goes up, the offensive points will increase, but you’d think that as the amount of shots attempted went up, that point totals would go up, but they do not. Based on the summary of the model, it seems as though all of the variables provided were good predictors of score except attempted free throws. This seems a little weird because logically, the more free throws taken the more points you should score. One thing that may be influencing the output is that there is some missing data in the form of zeros for some of the earlier years when that stat was not kept track of. This model also doesn’t take into account the the opposing team and it’s defensive/offensive stats. That is something that I think can be looked at in the future when making a more accurate model.